{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/node_postprocessor/ColbertRerank.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Colbert Rerank"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.\n",
    "\n",
    "\n",
    "[Colbert](https://github.com/stanford-futuredata/ColBERT): ColBERT is a fast and accurate retrieval model, enabling scalable BERT-based search over large text collections in tens of milliseconds.\n",
    "\n",
    "This example shows how we use Colbert-V2 model as a reranker."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install llama-index\n",
    "!pip install llama-index-core\n",
    "!pip install --quiet transformers torch\n",
    "!pip install llama-index-embeddings-openai\n",
    "!pip install llama-index-llms-openai\n",
    "!pip install llama-index-postprocessor-colbert-rerank"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core import (\n",
    "    VectorStoreIndex,\n",
    "    SimpleDirectoryReader,\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Download Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!mkdir -p 'data/paul_graham/'\n",
    "!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "os.environ[\"OPENAI_API_KEY\"] = \"sk-\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# load documents\n",
    "documents = SimpleDirectoryReader(\"./data/paul_graham/\").load_data()\n",
    "\n",
    "# build index\n",
    "index = VectorStoreIndex.from_documents(documents=documents)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Retrieve top 10 most relevant nodes, then filter with Colbert Rerank"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.postprocessor.colbert_rerank import ColbertRerank\n",
    "\n",
    "colbert_reranker = ColbertRerank(\n",
    "    top_n=5,\n",
    "    model=\"colbert-ir/colbertv2.0\",\n",
    "    tokenizer=\"colbert-ir/colbertv2.0\",\n",
    "    keep_retrieval_score=True,\n",
    ")\n",
    "\n",
    "query_engine = index.as_query_engine(\n",
    "    similarity_top_k=10,\n",
    "    node_postprocessors=[colbert_reranker],\n",
    ")\n",
    "response = query_engine.query(\n",
    "    \"What did Sam Altman do in this essay?\",\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "50157136-f221-4468-83e1-44e289f44cd5\n",
      "When I was dealing with some urgent problem during YC, there was about a 60% chance it had to do with HN, and a 40% chan\n",
      "reranking score:  0.6470144987106323\n",
      "retrieval score:  0.8309200279065135\n",
      "**********\n",
      "87f0d691-b631-4b21-8123-8f71d383046b\n",
      "Now that I could write essays again, I wrote a bunch about topics I'd had stacked up. I kept writing essays through 2020\n",
      "reranking score:  0.6377773284912109\n",
      "retrieval score:  0.8053000783543145\n",
      "**********\n",
      "10234ad9-46b1-4be5-8034-92392ac242ed\n",
      "It's not that unprestigious types of work are good per se. But when you find yourself drawn to some kind of work despite\n",
      "reranking score:  0.6301894187927246\n",
      "retrieval score:  0.7975032272825491\n",
      "**********\n",
      "bc269bc4-49c7-4804-8575-cd6db47d70b8\n",
      "It was as weird as it sounds. I resumed all my old patterns, except now there were doors where there hadn't been. Now wh\n",
      "reranking score:  0.6282549500465393\n",
      "retrieval score:  0.8026253284729862\n",
      "**********\n",
      "ebd7e351-64fc-4627-8ddd-2681d1ac33f8\n",
      "As Jessica and I were walking home from dinner on March 11, at the corner of Garden and Walker streets, these three thre\n",
      "reranking score:  0.6245909929275513\n",
      "retrieval score:  0.7965812262372882\n",
      "**********\n"
     ]
    }
   ],
   "source": [
    "for node in response.source_nodes:\n",
    "    print(node.id_)\n",
    "    print(node.node.get_content()[:120])\n",
    "    print(\"reranking score: \", node.score)\n",
    "    print(\"retrieval score: \", node.node.metadata[\"retrieval_score\"])\n",
    "    print(\"**********\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Sam Altman became the second president of Y Combinator after Paul Graham decided to step back from running the organization.\n"
     ]
    }
   ],
   "source": [
    "print(response)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "response = query_engine.query(\n",
    "    \"Which schools did Paul attend?\",\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "6942863e-dfc5-4a99-b642-967b99b71343\n",
      "I didn't want to drop out of grad school, but how else was I going to get out? I remember when my friend Robert Morris g\n",
      "reranking score:  0.6333063840866089\n",
      "retrieval score:  0.7964996889742813\n",
      "**********\n",
      "477c5de0-8e05-494e-95cc-e221881fb5c1\n",
      "What I Worked On\n",
      "\n",
      "February 2021\n",
      "\n",
      "Before college the two main things I worked on, outside of school, were writing and pro\n",
      "reranking score:  0.5930159091949463\n",
      "retrieval score:  0.7771872700578062\n",
      "**********\n",
      "0448df5c-7950-483d-bc63-15e9110da3bc\n",
      "[15] We got 225 applications for the Summer Founders Program, and we were surprised to find that a lot of them were from\n",
      "reranking score:  0.5160146951675415\n",
      "retrieval score:  0.7782554326959897\n",
      "**********\n",
      "83af8efd-e992-4fd3-ada4-3c4c6f9971a1\n",
      "Much to my surprise, the time I spent working on this stuff was not wasted after all. After we started Y Combinator, I w\n",
      "reranking score:  0.5005874633789062\n",
      "retrieval score:  0.7800375923908894\n",
      "**********\n",
      "bc269bc4-49c7-4804-8575-cd6db47d70b8\n",
      "It was as weird as it sounds. I resumed all my old patterns, except now there were doors where there hadn't been. Now wh\n",
      "reranking score:  0.4977223873138428\n",
      "retrieval score:  0.782688582042514\n",
      "**********\n"
     ]
    }
   ],
   "source": [
    "for node in response.source_nodes:\n",
    "    print(node.id_)\n",
    "    print(node.node.get_content()[:120])\n",
    "    print(\"reranking score: \", node.score)\n",
    "    print(\"retrieval score: \", node.node.metadata[\"retrieval_score\"])\n",
    "    print(\"**********\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Paul attended Cornell University for his graduate studies and later applied to RISD (Rhode Island School of Design) in the US.\n"
     ]
    }
   ],
   "source": [
    "print(response)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
